This assignment's objective is to examine Berlin Marathon log data, with a particular emphasis on split and finish times. The analysis's goal is to comprehend how pacing tactics affect finish times. Strategies for pacing, such as even pacing, positive splits (first half faster), and negative splits (second half faster than the first), will be evaluated. Extreme positive splits, also referred to as "Hitting the Wall" or bonking, are also taken into consideration in this assignment. Getting information about pacing from the race data is the main goal.
import pandas as pd
from functools import reduce
import plotly.graph_objects as go
import plotly.subplots as sp
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, ttest_ind
import warnings
import seaborn as sns
import plotly.express as px
primary_dataset = pd.read_csv('data/2013_data.csv')
secondary_dataset = pd.read_csv('data/2022_data.csv')
primary_dataset.head()
| id | platz | nation | jahrgang | ak | ak_plazierung | sex | sex_plazierung | z5 | z10 | z15 | z20 | z25 | z30 | z35 | z40 | halbmarathon | netto | brutto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 152 | 22753 | DEU | 1953 | 60 | 403 | M | 19040 | 00:29:30 | 00:59:24 | 01:28:59 | 01:59:07 | 02:29:48 | 02:59:52 | 03:30:06 | 04:00:18 | 02:05:56 | 04:13:55 | 04:29:54 |
| 1 | 153 | 22051 | DEU | 1969 | 40 | 753 | W | 3511 | 00:29:34 | 00:59:35 | 01:29:19 | 01:58:52 | 02:28:32 | 02:58:41 | 03:28:16 | 03:58:22 | 02:05:14 | 04:11:39 | 04:27:31 |
| 2 | 154 | 14033 | DEU | 1963 | 50 | 1512 | M | 12510 | 00:26:11 | 00:52:58 | 01:19:12 | 01:45:43 | 02:12:12 | 02:39:02 | 03:05:54 | 03:36:41 | 01:51:26 | 03:50:28 | 04:02:01 |
| 3 | 155 | 11137 | DEU | 1960 | 50 | 1183 | M | 10103 | 00:26:08 | 00:52:12 | 01:18:07 | 01:44:13 | 02:10:49 | 02:37:45 | 03:04:08 | 03:31:11 | 01:49:47 | 03:42:32 | 03:54:12 |
| 4 | 156 | 23030 | DEU | 1956 | 55 | 1077 | M | 19234 | 00:31:37 | 01:01:45 | 01:30:54 | 02:00:43 | 02:30:46 | 03:01:38 | 03:31:15 | 04:02:06 | 02:07:09 | 04:14:51 | 04:30:48 |
secondary_dataset.head()
| id | platz | nation | jahrgang | ak | ak_plazierung | sex | sex_plazierung | z5 | z10 | z15 | z20 | z25 | z30 | z35 | z40 | halbmarathon | netto | brutto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3381 | 1 | KEN | 1984 | 35 | 1 | M | 1 | 00:14:14 | 00:28:23 | 00:42:33 | 00:56:45 | 01:11:08 | 01:25:40 | 01:40:10 | 01:54:53 | 00:59:51 | 02:01:09 | 02:01:09 |
| 1 | 3382 | 2 | KEN | 1988 | 30 | 1 | M | 2 | 00:14:22 | 00:28:56 | 00:43:35 | 00:58:14 | 01:13:07 | 01:28:06 | 01:43:25 | 01:59:05 | 01:01:26 | 02:05:58 | 02:05:58 |
| 2 | 3383 | 3 | ETH | 1997 | H | 1 | M | 3 | 00:14:50 | 00:29:46 | 00:44:40 | 00:59:40 | 01:14:44 | 01:30:01 | 01:44:55 | 02:00:03 | 01:02:55 | 02:06:28 | 02:06:28 |
| 3 | 3384 | 4 | ETH | 1998 | H | 2 | M | 4 | 00:14:16 | 00:28:23 | 00:42:33 | 00:56:45 | 01:11:09 | 01:26:11 | 01:42:14 | 01:59:14 | 00:59:51 | 02:06:40 | 02:06:40 |
| 4 | 3385 | 5 | KEN | 1994 | H | 3 | M | 5 | 00:14:22 | 00:28:55 | 00:43:35 | 00:58:14 | 01:13:07 | 01:28:03 | 01:43:08 | 01:59:14 | 01:01:25 | 02:06:49 | 02:06:49 |
Here, we can see the columns and their values respectively.
Note:
def hh_mm_ss2seconds(hh_mm_ss):
return reduce(lambda acc, x: acc * 60 + x, map(int, hh_mm_ss.split(':')))
def clean_load_df(path, time_columns, dataset_name):
df = pd.read_csv(path, converters={col: hh_mm_ss2seconds for col in time_columns})
df = df.apply(lambda x: x.str.upper() if x.dtype == 'object' else x)
missing_values = df.isnull().sum()
print(f"Missing values in {dataset_name} dataset:\n{missing_values}")
df = df.dropna()
return df
time_columns = ['z5', 'z10', 'z15', 'z20', 'z25', 'z30', 'z35', 'z40', 'halbmarathon', 'netto', 'brutto']
primary_dataset = clean_load_df('data/2013_data.csv', time_columns, dataset_name='primary')
secondary_dataset = clean_load_df('data/2022_data.csv', time_columns, dataset_name='Secondary')
Missing values in primary dataset: id 0 platz 0 nation 0 jahrgang 0 ak 9 ak_plazierung 0 sex 0 sex_plazierung 0 z5 0 z10 0 z15 0 z20 0 z25 0 z30 0 z35 0 z40 0 halbmarathon 0 netto 0 brutto 0 dtype: int64 Missing values in Secondary dataset: id 0 platz 0 nation 0 jahrgang 0 ak 7 ak_plazierung 0 sex 0 sex_plazierung 0 z5 0 z10 0 z15 0 z20 0 z25 0 z30 0 z35 0 z40 0 halbmarathon 0 netto 0 brutto 0 dtype: int64
primary_dataset = primary_dataset[primary_dataset['halbmarathon'] != 0]
secondary_dataset = secondary_dataset[secondary_dataset['halbmarathon'] != 0]
print("After Removing missing values from primary dataset:")
print(primary_dataset.isnull().sum())
print("\nAfter Removing missing values from secondary dataset:")
print(secondary_dataset.isnull().sum())
After Removing missing values from primary dataset: id 0 platz 0 nation 0 jahrgang 0 ak 0 ak_plazierung 0 sex 0 sex_plazierung 0 z5 0 z10 0 z15 0 z20 0 z25 0 z30 0 z35 0 z40 0 halbmarathon 0 netto 0 brutto 0 dtype: int64 After Removing missing values from secondary dataset: id 0 platz 0 nation 0 jahrgang 0 ak 0 ak_plazierung 0 sex 0 sex_plazierung 0 z5 0 z10 0 z15 0 z20 0 z25 0 z30 0 z35 0 z40 0 halbmarathon 0 netto 0 brutto 0 dtype: int64
print("Primary Dataset:")
primary_dataset.head()
Primary Dataset:
| id | platz | nation | jahrgang | ak | ak_plazierung | sex | sex_plazierung | z5 | z10 | z15 | z20 | z25 | z30 | z35 | z40 | halbmarathon | netto | brutto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 152 | 22753 | DEU | 1953 | 60 | 403 | M | 19040 | 1770 | 3564 | 5339 | 7147 | 8988 | 10792 | 12606 | 14418 | 7556 | 15235 | 16194 |
| 1 | 153 | 22051 | DEU | 1969 | 40 | 753 | W | 3511 | 1774 | 3575 | 5359 | 7132 | 8912 | 10721 | 12496 | 14302 | 7514 | 15099 | 16051 |
| 2 | 154 | 14033 | DEU | 1963 | 50 | 1512 | M | 12510 | 1571 | 3178 | 4752 | 6343 | 7932 | 9542 | 11154 | 13001 | 6686 | 13828 | 14521 |
| 3 | 155 | 11137 | DEU | 1960 | 50 | 1183 | M | 10103 | 1568 | 3132 | 4687 | 6253 | 7849 | 9465 | 11048 | 12671 | 6587 | 13352 | 14052 |
| 4 | 156 | 23030 | DEU | 1956 | 55 | 1077 | M | 19234 | 1897 | 3705 | 5454 | 7243 | 9046 | 10898 | 12675 | 14526 | 7629 | 15291 | 16248 |
print("\nSecondary Dataset:")
secondary_dataset.head()
Secondary Dataset:
| id | platz | nation | jahrgang | ak | ak_plazierung | sex | sex_plazierung | z5 | z10 | z15 | z20 | z25 | z30 | z35 | z40 | halbmarathon | netto | brutto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3381 | 1 | KEN | 1984 | 35 | 1 | M | 1 | 854 | 1703 | 2553 | 3405 | 4268 | 5140 | 6010 | 6893 | 3591 | 7269 | 7269 |
| 1 | 3382 | 2 | KEN | 1988 | 30 | 1 | M | 2 | 862 | 1736 | 2615 | 3494 | 4387 | 5286 | 6205 | 7145 | 3686 | 7558 | 7558 |
| 2 | 3383 | 3 | ETH | 1997 | H | 1 | M | 3 | 890 | 1786 | 2680 | 3580 | 4484 | 5401 | 6295 | 7203 | 3775 | 7588 | 7588 |
| 3 | 3384 | 4 | ETH | 1998 | H | 2 | M | 4 | 856 | 1703 | 2553 | 3405 | 4269 | 5171 | 6134 | 7154 | 3591 | 7600 | 7600 |
| 4 | 3385 | 5 | KEN | 1994 | H | 3 | M | 5 | 862 | 1735 | 2615 | 3494 | 4387 | 5283 | 6188 | 7154 | 3685 | 7609 | 7609 |
Summary:
Here, I remove the missing values from the dataset (Primary2013 and Secondary2022) and exclude the 0 inpute records from the dataset respectively. I used time conversion function to convert time values in the format HH:MM:SS to seconds. This data cleaning process is needed for the further analysis.
Summary Matrix:
Correlation Matrix:
# Summary statistics
summary_stats = primary_dataset.describe()
print(f"\nSummary Statistics for Primary dataset:")
summary_stats
Summary Statistics for Primary dataset:
| id | platz | jahrgang | ak_plazierung | sex_plazierung | z5 | z10 | z15 | z20 | z25 | z30 | z35 | z40 | halbmarathon | netto | brutto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 | 36452.000000 |
| mean | 18865.656974 | 18281.564359 | 1970.093740 | 1631.857868 | 11508.597196 | 1660.198069 | 3336.205476 | 5005.339542 | 6700.602656 | 8422.562246 | 10214.364068 | 12023.310847 | 13899.495775 | 7071.863958 | 14719.763881 | 15652.852573 |
| std | 10779.508291 | 10554.363264 | 9.972185 | 1330.486694 | 8088.344200 | 244.063959 | 503.183120 | 771.816290 | 1033.117122 | 1355.438898 | 1653.749961 | 1989.313759 | 2353.713212 | 1087.890709 | 2468.412850 | 2933.840603 |
| min | 152.000000 | 1.000000 | 1928.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3692.000000 | 7403.000000 | 0.000000 |
| 25% | 9527.500000 | 9142.750000 | 1963.000000 | 549.750000 | 4577.750000 | 1492.000000 | 2994.000000 | 4484.750000 | 5991.000000 | 7510.000000 | 9061.000000 | 10636.000000 | 12276.000000 | 6317.000000 | 12994.000000 | 13532.000000 |
| 50% | 18869.500000 | 18280.500000 | 1970.000000 | 1244.500000 | 9303.500000 | 1654.000000 | 3321.000000 | 4972.500000 | 6645.000000 | 8342.500000 | 10081.000000 | 11824.000000 | 13617.000000 | 7008.000000 | 14381.000000 | 15411.000000 |
| 75% | 28193.250000 | 27426.250000 | 1977.000000 | 2495.000000 | 18440.250000 | 1818.000000 | 3651.000000 | 5483.000000 | 7342.250000 | 9260.000000 | 11239.000000 | 13259.000000 | 15357.000000 | 7746.000000 | 16246.000000 | 17620.000000 |
| max | 37534.000000 | 36568.000000 | 1995.000000 | 5284.000000 | 27577.000000 | 3096.000000 | 5615.000000 | 8361.000000 | 11076.000000 | 14280.000000 | 17369.000000 | 20785.000000 | 23077.000000 | 11721.000000 | 25321.000000 | 26231.000000 |
# Correlation matrix
correlation_matrix = primary_dataset.corr(numeric_only=True)
print(f"\nCorrelation Matrix for Primary dataset:")
correlation_matrix
Correlation Matrix for Primary dataset:
| id | platz | jahrgang | ak_plazierung | sex_plazierung | z5 | z10 | z15 | z20 | z25 | z30 | z35 | z40 | halbmarathon | netto | brutto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | 1.000000 | 0.015583 | 0.015831 | 0.002472 | 0.008980 | 0.018643 | 0.016722 | 0.017299 | 0.016094 | 0.017785 | 0.017186 | 0.017379 | 0.016477 | 0.017168 | 0.019194 | 0.019727 |
| platz | 0.015583 | 1.000000 | -0.161033 | 0.471429 | 0.651280 | 0.897869 | 0.903838 | 0.903334 | 0.929208 | 0.920572 | 0.947441 | 0.954216 | 0.948403 | 0.936801 | 0.968782 | 0.969400 |
| jahrgang | 0.015831 | -0.161033 | 1.000000 | 0.028964 | -0.189074 | -0.142190 | -0.150286 | -0.154424 | -0.164262 | -0.164859 | -0.166782 | -0.166461 | -0.161973 | -0.164773 | -0.168933 | -0.141555 |
| ak_plazierung | 0.002472 | 0.471429 | 0.028964 | 1.000000 | 0.772383 | 0.383733 | 0.381237 | 0.379681 | 0.388801 | 0.386208 | 0.405421 | 0.415887 | 0.428671 | 0.391872 | 0.438931 | 0.441337 |
| sex_plazierung | 0.008980 | 0.651280 | -0.189074 | 0.772383 | 1.000000 | 0.535474 | 0.535983 | 0.539206 | 0.554399 | 0.553095 | 0.580128 | 0.594813 | 0.607089 | 0.559569 | 0.623911 | 0.618011 |
| z5 | 0.018643 | 0.897869 | -0.142190 | 0.383733 | 0.535474 | 1.000000 | 0.962696 | 0.947572 | 0.959359 | 0.932164 | 0.933605 | 0.919731 | 0.886981 | 0.963415 | 0.904350 | 0.919598 |
| z10 | 0.016722 | 0.903838 | -0.150286 | 0.381237 | 0.535983 | 0.962696 | 1.000000 | 0.950056 | 0.965781 | 0.939212 | 0.943457 | 0.930165 | 0.897377 | 0.970306 | 0.915656 | 0.927912 |
| z15 | 0.017299 | 0.903334 | -0.154424 | 0.379681 | 0.539206 | 0.947572 | 0.950056 | 1.000000 | 0.962947 | 0.941839 | 0.947245 | 0.936219 | 0.902245 | 0.968083 | 0.919180 | 0.927662 |
| z20 | 0.016094 | 0.929208 | -0.164262 | 0.388801 | 0.554399 | 0.959359 | 0.965781 | 0.962947 | 1.000000 | 0.967556 | 0.973447 | 0.965500 | 0.930205 | 0.992844 | 0.951169 | 0.956477 |
| z25 | 0.017785 | 0.920572 | -0.164859 | 0.386208 | 0.553095 | 0.932164 | 0.939212 | 0.941839 | 0.967556 | 1.000000 | 0.963344 | 0.955607 | 0.924137 | 0.972463 | 0.946458 | 0.947570 |
| z30 | 0.017186 | 0.947441 | -0.166782 | 0.405421 | 0.580128 | 0.933605 | 0.943457 | 0.947245 | 0.973447 | 0.963344 | 1.000000 | 0.980771 | 0.958459 | 0.981577 | 0.974426 | 0.972102 |
| z35 | 0.017379 | 0.954216 | -0.166461 | 0.415887 | 0.594813 | 0.919731 | 0.930165 | 0.936219 | 0.965500 | 0.955607 | 0.980771 | 1.000000 | 0.965028 | 0.971846 | 0.984223 | 0.977729 |
| z40 | 0.016477 | 0.948403 | -0.161973 | 0.428671 | 0.607089 | 0.886981 | 0.897377 | 0.902245 | 0.930205 | 0.924137 | 0.958459 | 0.965028 | 1.000000 | 0.938668 | 0.969978 | 0.961539 |
| halbmarathon | 0.017168 | 0.936801 | -0.164773 | 0.391872 | 0.559569 | 0.963415 | 0.970306 | 0.968083 | 0.992844 | 0.972463 | 0.981577 | 0.971846 | 0.938668 | 1.000000 | 0.960078 | 0.964610 |
| netto | 0.019194 | 0.968782 | -0.168933 | 0.438931 | 0.623911 | 0.904350 | 0.915656 | 0.919180 | 0.951169 | 0.946458 | 0.974426 | 0.984223 | 0.969978 | 0.960078 | 1.000000 | 0.988371 |
| brutto | 0.019727 | 0.969400 | -0.141555 | 0.441337 | 0.618011 | 0.919598 | 0.927912 | 0.927662 | 0.956477 | 0.947570 | 0.972102 | 0.977729 | 0.961539 | 0.964610 | 0.988371 | 1.000000 |
# Summary statistics
summary_stats = secondary_dataset.describe()
print(f"\nSummary Statistics for Secondary dataset:")
summary_stats
Summary Statistics for Secondary dataset:
| id | platz | jahrgang | ak_plazierung | sex_plazierung | z5 | z10 | z15 | z20 | z25 | z30 | z35 | z40 | halbmarathon | netto | brutto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 | 34724.000000 |
| mean | 20753.856612 | 17376.021397 | 1978.902114 | 1264.282456 | 9678.847713 | 1668.268777 | 3359.813414 | 5063.242656 | 6792.635756 | 8578.037928 | 10432.448047 | 12327.157528 | 14307.832479 | 7170.832796 | 15140.201100 | 15487.002995 |
| std | 10030.074395 | 10031.989265 | 10.683553 | 934.498868 | 6435.072950 | 282.526120 | 587.091829 | 907.490698 | 1255.230187 | 1638.801193 | 2064.926889 | 2512.857196 | 2958.812524 | 1324.205628 | 3098.147153 | 3127.074812 |
| min | 3381.000000 | 1.000000 | 1935.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3591.000000 | 7269.000000 | 7269.000000 |
| 25% | 12068.750000 | 8688.750000 | 1971.000000 | 493.000000 | 4343.750000 | 1474.000000 | 2951.000000 | 4433.000000 | 5922.750000 | 7430.000000 | 8966.750000 | 10537.000000 | 12188.000000 | 6246.000000 | 12905.000000 | 13312.750000 |
| 50% | 20754.500000 | 17376.500000 | 1979.000000 | 1089.000000 | 8685.500000 | 1658.000000 | 3328.000000 | 5002.000000 | 6690.000000 | 8412.000000 | 10182.000000 | 11997.000000 | 13927.500000 | 7057.000000 | 14743.000000 | 15119.000000 |
| 75% | 29439.250000 | 26063.250000 | 1987.000000 | 1827.000000 | 14557.250000 | 1856.000000 | 3743.000000 | 5639.000000 | 7577.000000 | 9586.000000 | 11692.000000 | 13871.250000 | 16166.000000 | 7992.000000 | 17101.000000 | 17489.000000 |
| max | 38125.000000 | 34752.000000 | 2004.000000 | 3826.000000 | 23246.000000 | 4858.000000 | 6957.000000 | 9551.000000 | 12893.000000 | 16454.000000 | 19901.000000 | 23398.000000 | 26892.000000 | 13648.000000 | 28385.000000 | 28726.000000 |
# Correlation matrix
correlation_matrix = secondary_dataset.corr(numeric_only=True)
print(f"\nCorrelation Matrix for Secondary dataset:")
correlation_matrix
Correlation Matrix for Secondary dataset:
| id | platz | jahrgang | ak_plazierung | sex_plazierung | z5 | z10 | z15 | z20 | z25 | z30 | z35 | z40 | halbmarathon | netto | brutto | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | 1.000000 | 1.000000 | -0.213088 | 0.559392 | 0.728643 | 0.909948 | 0.923931 | 0.928243 | 0.931870 | 0.943503 | 0.953334 | 0.957996 | 0.963941 | 0.939431 | 0.974226 | 0.971923 |
| platz | 1.000000 | 1.000000 | -0.213087 | 0.559386 | 0.728640 | 0.909951 | 0.923936 | 0.928250 | 0.931878 | 0.943512 | 0.953345 | 0.958007 | 0.963951 | 0.939440 | 0.974236 | 0.971932 |
| jahrgang | -0.213088 | -0.213087 | 1.000000 | 0.128877 | -0.255340 | -0.199546 | -0.208912 | -0.212135 | -0.213663 | -0.212757 | -0.214436 | -0.210787 | -0.209275 | -0.214350 | -0.216690 | -0.217708 |
| ak_plazierung | 0.559392 | 0.559386 | 0.128877 | 1.000000 | 0.787620 | 0.462249 | 0.463504 | 0.462379 | 0.465644 | 0.477334 | 0.491154 | 0.508157 | 0.524066 | 0.469950 | 0.532678 | 0.532896 |
| sex_plazierung | 0.728643 | 0.728640 | -0.255340 | 0.787620 | 1.000000 | 0.613366 | 0.620509 | 0.621853 | 0.625897 | 0.639990 | 0.656769 | 0.674251 | 0.690539 | 0.631944 | 0.703687 | 0.703083 |
| z5 | 0.909948 | 0.909951 | -0.199546 | 0.462249 | 0.613366 | 1.000000 | 0.989677 | 0.980028 | 0.968188 | 0.958865 | 0.945307 | 0.925610 | 0.914405 | 0.971864 | 0.917924 | 0.919837 |
| z10 | 0.923931 | 0.923936 | -0.208912 | 0.463504 | 0.620509 | 0.989677 | 1.000000 | 0.993414 | 0.984715 | 0.977090 | 0.964838 | 0.946120 | 0.934529 | 0.988732 | 0.938170 | 0.939389 |
| z15 | 0.928243 | 0.928250 | -0.212135 | 0.462379 | 0.621853 | 0.980028 | 0.993414 | 1.000000 | 0.988328 | 0.984583 | 0.972775 | 0.955002 | 0.943805 | 0.992872 | 0.947602 | 0.947936 |
| z20 | 0.931870 | 0.931878 | -0.213663 | 0.465644 | 0.625897 | 0.968188 | 0.984715 | 0.988328 | 1.000000 | 0.986399 | 0.978385 | 0.963695 | 0.952846 | 0.993169 | 0.955834 | 0.955294 |
| z25 | 0.943503 | 0.943512 | -0.212757 | 0.477334 | 0.639990 | 0.958865 | 0.977090 | 0.984583 | 0.986399 | 1.000000 | 0.989390 | 0.975979 | 0.966606 | 0.992985 | 0.970624 | 0.969112 |
| z30 | 0.953334 | 0.953345 | -0.214436 | 0.491154 | 0.656769 | 0.945307 | 0.964838 | 0.972775 | 0.978385 | 0.989390 | 1.000000 | 0.985357 | 0.977844 | 0.985655 | 0.981957 | 0.979793 |
| z35 | 0.957996 | 0.958007 | -0.210787 | 0.508157 | 0.674251 | 0.925610 | 0.946120 | 0.955002 | 0.963695 | 0.975979 | 0.985357 | 1.000000 | 0.981064 | 0.969767 | 0.985297 | 0.982706 |
| z40 | 0.963941 | 0.963951 | -0.209275 | 0.524066 | 0.690539 | 0.914405 | 0.934529 | 0.943805 | 0.952846 | 0.966606 | 0.977844 | 0.981064 | 1.000000 | 0.959579 | 0.989897 | 0.987042 |
| halbmarathon | 0.939431 | 0.939440 | -0.214350 | 0.469950 | 0.631944 | 0.971864 | 0.988732 | 0.992872 | 0.993169 | 0.992985 | 0.985655 | 0.969767 | 0.959579 | 1.000000 | 0.964508 | 0.963714 |
| netto | 0.974226 | 0.974236 | -0.216690 | 0.532678 | 0.703687 | 0.917924 | 0.938170 | 0.947602 | 0.955834 | 0.970624 | 0.981957 | 0.985297 | 0.989897 | 0.964508 | 1.000000 | 0.996977 |
| brutto | 0.971923 | 0.971932 | -0.217708 | 0.532896 | 0.703083 | 0.919837 | 0.939389 | 0.947936 | 0.955294 | 0.969112 | 0.979793 | 0.982706 | 0.987042 | 0.963714 | 0.996977 | 1.000000 |
scatter_fig = px.scatter(primary_dataset, x='halbmarathon', y='netto',
title='Scatter Plot - Half Marathon Time vs Finish Time',
labels={'halbmarathon': 'Half Marathon Time (seconds)', 'netto': 'Finish Time (seconds)'})
scatter_fig.show()
Summary:
color_palette = px.colors.qualitative.Set1
boxplot_sex_fig = px.box(primary_dataset, x='sex', y='netto',
title='Boxplot of Finish Time vs Gender (Primary Dataset)',
labels={'sex': 'Gender', 'netto': 'Finish Time (seconds)'},
color='sex', color_discrete_sequence=color_palette)
boxplot_sex_fig.show()
Summary:
boxplot_jahrgang_fig = px.box(primary_dataset, x='jahrgang', y='netto',
title='Boxplot of Finish Time vs Birth Year (Primary Dataset)',
labels={'jahrgang': 'Birth Year', 'netto': 'Finish Time (seconds)'},
color='jahrgang', color_discrete_sequence=color_palette)
boxplot_jahrgang_fig.show()
Summary:
color_palette = px.colors.qualitative.Set1
boxplot_ak_fig = px.box(primary_dataset, x='ak', y='netto',
title='Boxplot of Finish Time vs Age Group (Primary Dataset)',
labels={'ak': 'Age Group', 'netto': 'Finish Time (seconds)'},
color='ak', color_discrete_sequence=color_palette)
boxplot_ak_fig.show()
Summary:
average_time_by_nation = primary_dataset.groupby('nation')['netto'].mean().reset_index()
top_10_nations = average_time_by_nation.nlargest(10, 'netto')
warnings.simplefilter(action='ignore', category=FutureWarning)
plt.figure(figsize=(10, 5))
sns.barplot(x='nation', y='netto', data=top_10_nations, palette='coolwarm')
plt.title('Top 10 Nations with the Highest Average Netto (Finish Time)')
plt.xlabel('Nation')
plt.ylabel('Average Netto (Finish Time)')
plt.show()
Summary:
plt.figure(figsize=(12, 8))
sns.heatmap(primary_dataset.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap (Primary Dataset)')
plt.show()
corr_coefficient, p_value = pearsonr(primary_dataset['netto'], primary_dataset['halbmarathon'])
print(f"\nPrimary dataset: \nPearson correlation coefficient between 'netto' and 'halbmarathon': {corr_coefficient:.4f}")
male_finish_times = primary_dataset[primary_dataset['sex'] == 'M']['netto']
female_finish_times = primary_dataset[primary_dataset['sex'] == 'W']['netto']
t_statistic, p_value_ttest = ttest_ind(male_finish_times, female_finish_times)
print(f"\nT-test for 'sex' and 'netto' in Primary dataset:")
print(f"T-statistic: {t_statistic:.4f}")
Primary dataset: Pearson correlation coefficient between 'netto' and 'halbmarathon': 0.9601 T-test for 'sex' and 'netto' in Primary dataset: T-statistic: -56.1901
Summary:
The half marathon time, or "halbmarathon," and the finish time, "netto," have a Pearson correlation coefficient of 0.9601. A strong linear relationship between the total finish time and the half marathon time is indicated by this high positive correlation (nearly 1). 'Netto' also tends to increase as 'halbmarathon' increases.
The'sex' and 'netto' t-tests compare the times of completion for men and women competitors. -56.1901 is the determined t-statistic. A significant difference in finish times between men and women athletes is indicated by this large negative t-statistic.
The primary dataset's numeric variable correlation matrix is displayed visually in the above heatmap.
Positive Correlation:
Negative Correlation:
Correlation Strength:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='halbmarathon', y='netto', data=primary_dataset)
plt.title("Relationship between Half Marathon Time vs Finish Time - 2013")
plt.xlabel("halbmarathon (Seconds)")
plt.ylabel("netto (Seconds)")
plt.subplot(1, 2, 2)
sns.scatterplot(x='halbmarathon', y='netto', data=secondary_dataset)
plt.title("Relationship between Half Marathon Time vs Finish Time - 2022")
plt.xlabel("halbmarathon (Seconds)")
plt.ylabel("netto (Seconds)")
plt.tight_layout()
plt.show()
Summary:
Positive Trend: When the regression line slopes upward, it indicates that the half marathon time tends to grow on average, along with the finish time.
Strength of association: The strength of the association is indicated by how closely the data points cluster around the regression line. The correlation between the variables is stronger when the cluster of points is narrower.
Outliers: points where the regression line differs significantly from the average. A singular instance that differs from the overall pattern shownm in the data is represented by an outlier.
There is a positive linear correlation, indicating that the trend observed in 2013 persists in 2022.
# correlation coefficient func
def calculate_correlation(dataset, x_col, y_col):
correlation_coefficient, _ = pearsonr(dataset[x_col], dataset[y_col])
return correlation_coefficient
# linear regression and get coefficients func
def calculate_coefficients(dataset, x_col, y_col):
X = dataset[[x_col]]
y = dataset[y_col]
model = LinearRegression()
model.fit(X, y)
intercept = model.intercept_
coefficient = model.coef_[0]
return intercept, coefficient
# correlation coefficient and coefficients - primary dataset
correlation_primary = calculate_correlation(primary_dataset, 'halbmarathon', 'netto')
intercept_primary, coefficient_primary = calculate_coefficients(primary_dataset, 'halbmarathon', 'netto')
# correlation coefficient and coefficients - secondary dataset
correlation_secondary = calculate_correlation(secondary_dataset, 'halbmarathon', 'netto')
intercept_secondary, coefficient_secondary = calculate_coefficients(secondary_dataset, 'halbmarathon', 'netto')
print("Primary Dataset:")
print(f"Correlation Coefficient: {correlation_primary:.4f}")
print(f"Intercept: {intercept_primary:.4f}")
print(f"Coefficient: {coefficient_primary:.4f}")
print("\nSecondary Dataset:")
print(f"Correlation Coefficient: {correlation_secondary:.4f}")
print(f"Intercept: {intercept_secondary:.4f}")
print(f"Coefficient: {coefficient_secondary:.4f}")
Primary Dataset: Correlation Coefficient: 0.9601 Intercept: -685.6404 Coefficient: 2.1784 Secondary Dataset: Correlation Coefficient: 0.9645 Intercept: -1041.4272 Coefficient: 2.2566
Summary:
correlation : A linear relationship between two variables' strength and direction are measured by correlation, which has values ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).
The correlation coefficient : provides a precise indication of how closely variables move together in a linear fashion by quantifying this relationship numerically.
primary_dataset['netto_first_half'] = primary_dataset['halbmarathon']
primary_dataset['netto_second_half'] = primary_dataset['netto'] - primary_dataset['halbmarathon']
# categorize split type func
def split_type(row):
if row['netto_first_half'] < row['netto_second_half']:
return 'Negative Split'
elif row['netto_first_half'] > row['netto_second_half']:
return 'Positive Split'
else:
return 'Even Split'
primary_dataset['split_type'] = primary_dataset.apply(split_type, axis=1)
plt.figure(figsize=(12, 6))
elite_threshold = primary_dataset['netto'].quantile(0.10)
primary_dataset['is_elite'] = primary_dataset['netto'] <= elite_threshold
plt.subplot(1, 2, 1)
sns.violinplot(x='split_type', y='netto', data=primary_dataset, palette='pastel', inner='quartile')
plt.title("Impact of Pacing on Finish Times")
plt.xlabel("Split Type")
plt.ylabel("Netto Finish Time (seconds)")
plt.subplot(1, 2, 2)
sns.violinplot(x='split_type', y='netto', hue='is_elite', data=primary_dataset, palette='Set1', split=True, inner='quartile')
plt.title("Impact of Pacing on Finish Times by Elite Status")
plt.xlabel("Split Type")
plt.ylabel("Netto Finish Time (seconds)")
plt.legend(title='Elite Status')
plt.tight_layout()
plt.show()
Summary:
fig_strip = px.strip(primary_dataset, x='split_type', y='netto', color='sex',
title="Impact of Pacing on Finish Times by Gender (Strip Plot)",
labels={'split_type': 'Split Type', 'netto_second_half': 'Netto Finish Time (seconds)'},
category_orders={'sex': ['MALE', 'FEMALE']})
fig_strip.update_layout(legend=dict(title=dict(text='Gender')))
fig_strip.show()
Summary:
According to this above pattern, runners who accomplish a negative split—running faster in the second half than the first generally have longer netto finish times than those who achieve a positive or even split.
custom_palette = ['#1f78b4', '#33a02c', '#e31a1c', '#ff7f00', '#6a3d9a', '#b15928']
fig_age_group = px.box(primary_dataset, x='split_type', y='netto', color='ak',
title="Impact of Pacing on Finish Times by Age Group",
labels={'split_type': 'Split Type', 'netto_second_half': 'Netto Finish Time (seconds) for Second Half'},
color_discrete_sequence=custom_palette)
fig_age_group.update_layout(legend=dict(title=dict(text='Age Group')))
fig_age_group.show()
Summary:
It has been noted that the "Negative Split" category typically has netto finish times that are higher than "Positive Split," which is followed by "Even Split." This suggests that runners who accomplish a negative spli running faster in the second half than the first generally have longer netto finish times than those who achieve a positive split or even split across age groups.
elite_count = primary_dataset.groupby(['sex', 'is_elite']).size().reset_index(name='count')
elite_count_pivot = elite_count.pivot(index='sex', columns='is_elite', values='count').fillna(0)
labels = ['Non-Elite Men', 'Elite Men', 'Non-Elite Women', 'Elite Women']
colors = ['#ff9999', '#ffcc99', '#66b3ff', '#99ff99']
explode = (0.1, 0, 0.1, 0)
fig, ax = plt.subplots()
ax.pie(elite_count_pivot.values.flatten(), labels=labels, autopct='%1.1f%%', colors=colors, startangle=90, pctdistance=0.85, explode=explode)
ax.axis('equal')
plt.title("Percentage of Elite and Non-Elite Athletes by Gender")
plt.legend(title='Elite Status', loc="upper right", bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()
Summary:
def split_type(row):
if row['netto_first_half'] < row['netto_second_half']:
return 'Negative Split'
elif row['netto_first_half'] > row['netto_second_half']:
return 'Positive Split'
else:
return 'Even Split'
primary_dataset['split_type'] = primary_dataset.apply(split_type, axis=1)
fig = px.violin(primary_dataset, x='ak', y='netto', color='is_elite', box=True, points="all",
title="Impact of Pacing on Finish Times by Age Group and Elite Status",
labels={'netto': 'Netto Finish Time (seconds)', 'ak': 'Age Group', 'is_elite': 'Elite Status'},
category_orders={'ak': sorted(primary_dataset['ak'].unique())},
width=800, height=600,
facet_col="sex", facet_col_wrap=2,
violinmode="overlay",
hover_data={'netto': True, 'is_elite': True, 'sex': True, 'ak': True, 'split_type': True},
template="plotly"
)
fig.show()
Summary:
for eg:
Runner - MEN :
Runner - WOMEN :
Here, we analysed that preference of the impact of Pacing 'netto' vs age group and elite status which depends of the elite runner and here - age H ie. 20-29 age groups people use negative split stratergy to complete the race. The trend of the elite stratergy runners from 20 - 65. And rest of the runners are the non elite.
primary_dataset['percentage_slowdown'] = ((primary_dataset['netto_second_half'] - primary_dataset['netto_first_half']) / primary_dataset['netto_first_half']) * 100
bonk_threshold = 30
primary_dataset['has_bonked'] = primary_dataset['percentage_slowdown'] >= bonk_threshold
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='sex', hue='has_bonked', data=primary_dataset, palette='Set2')
for p in ax.patches:
height = p.get_height()
ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height),
ha='center', va='baseline', fontsize=12, color='black', xytext=(0, 5),
textcoords='offset points')
plt.title("Bonking Across Gender with Counts")
plt.xlabel("Gender")
plt.ylabel("Count - Runners in race")
plt.show()
Summary:
The distribution of bonking events by gender is shown graphically in the countplot, which provides view on how this occurrence varies in men and women runners.
Here we analysed that Men(980) and women(122) who have the significant drop in pace (bonk) in the second half of at least 30%. Usually men count is high than women count.
fig = px.histogram(primary_dataset, x='ak', color='has_bonked', barmode='stack',
category_orders={'ak': sorted(primary_dataset['ak'].unique())},
labels={'ak': 'Age Group', 'has_bonked': 'Bonked'},
color_discrete_sequence=['#1f78b4', '#e31a1c'],
title="Bonking Across Age Group",
facet_col="sex", facet_col_wrap=2,
width=800, height=600)
fig.update_layout(xaxis_title="Age Group", yaxis_title="Count - Runners", legend_title="Bonked",
showlegend=True, template="plotly",
plot_bgcolor='rgba(255, 255, 255, 0.7)')
fig.show()
Summary:
fig = px.scatter(primary_dataset, x='netto', y='percentage_slowdown', color='has_bonked',
title="Bonking Across Finish Time",
labels={'netto': 'Netto Finish Time (seconds)', 'percentage_slowdown': 'Percentage Slowdown'},
color_discrete_map={False: 'blue', True: 'red'})
fig.show()
Summary:
The blue point concentration between -50% and 50% indicates that most runners who were not bonked slowed down in this range.
Here, I conclude that most bonked runners (shown in red) showed a slowdown between 30% and 50% OF their previous pace.
# Scatter plot for the first half
fig = go.Figure()
fig.add_trace(go.Scatter(x=primary_dataset[primary_dataset['sex'] == 'M']['netto_first_half'],
y=primary_dataset[primary_dataset['sex'] == 'M']['netto_second_half'],
mode='markers', name='Male', marker=dict(color='#1f78b4')))
fig.add_trace(go.Scatter(x=primary_dataset[primary_dataset['sex'] == 'W']['netto_first_half'],
y=primary_dataset[primary_dataset['sex'] == 'W']['netto_second_half'],
mode='markers', name='Female', marker=dict(color='#33a02c')))
fig.update_layout(title="First Half vs. Second Half by Gender",
xaxis_title="Netto Finish Time - First Half (seconds)",
yaxis_title="Netto Finish Time - Second Half (seconds)",
template="plotly", width=800, height=400)
fig.show()
# Scatter plot for the second half with bonked runners
fig_bonked = go.Figure()
fig_bonked.add_trace(go.Scatter(x=primary_dataset[~primary_dataset['has_bonked']]['netto_first_half'],
y=primary_dataset[~primary_dataset['has_bonked']]['netto_second_half'],
mode='markers', name='Not Bonked', marker=dict(color='#1f78b4')))
fig_bonked.add_trace(go.Scatter(x=primary_dataset[primary_dataset['has_bonked']]['netto_first_half'],
y=primary_dataset[primary_dataset['has_bonked']]['netto_second_half'],
mode='markers', name='Bonked', marker=dict(color='#e31a1c')))
fig_bonked.update_layout(title="First Half vs. Second Half with Bonked Runners Marked",
xaxis_title="Netto Finish Time - First Half (seconds)",
yaxis_title="Netto Finish Time - Second Half (seconds)",
template="plotly", width=800, height=400)
fig_bonked.show()
Summary:
First Half vs. Second Half by Gender (M,W):
First Half vs. Second Half with Bonked Runners Marked:
These visuals help to identify bonking times as well as understand the pacing patterns between halves, which can offer insight into performance-influencing variables. The above scatter plots show how runners perform in the first and second halves, with the first plotting gender differences and the second plotting runners who were bonked.
scatter_fig = px.scatter(secondary_dataset, x='halbmarathon', y='netto',
title='Scatter Plot - Half Marathon Time vs Finish Time (Secondary dataset)',
labels={'halbmarathon': 'Half Marathon Time (seconds)', 'netto': 'Finish Time (seconds)'})
scatter_fig.show()
Summary:
Here, I am using Scatter Plot like previously used in task 2 on primary dataset to see if there is a pattern or correlation between the half marathon time and the total finish time for runners in the secondary dataset.
This is uelpful for getting a better understanding of the structure of the dataset by pointing out possible linear relationships, outliers, or patterns of data points around particular areas.
Primary dataset represent the data for 2013 which has half marathon time seconds from 4k to 12k and here in secondary dataset starts from 4k to 14k comparatively increased by 2k difference.
Hence, to see the main findings from task 2 for this analysis hold in simillar trend for this secondary 2022 dataset.
secondary_dataset['pacing_seconds'] = secondary_dataset['netto'] - secondary_dataset['halbmarathon']
fig = px.box(secondary_dataset, x='ak', y='pacing_seconds', color='sex',
title='Pacing Analysis by Gender and Age Group (Secondary Dataset)')
fig.update_layout(xaxis_title='Age Group', yaxis_title='Pacing (seconds)', legend_title='Gender')
fig.show()
Summary:
To conclude that the both primary and secondary dataset hold simillar pattermn extraction from their data itself.
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=['Primary 2013', 'Secondary 2022'])
primary_dataset['pacing_seconds'] = primary_dataset['netto'] - primary_dataset['halbmarathon']
fig_primary = px.histogram(primary_dataset, x='pacing_seconds', nbins=20)
secondary_dataset['pacing_seconds'] = secondary_dataset['netto'] - secondary_dataset['halbmarathon']
fig_secondary = px.histogram(secondary_dataset, x='pacing_seconds', nbins=20)
fig.add_trace(fig_primary['data'][0], row=1, col=1)
fig.add_trace(fig_secondary['data'][0], row=1, col=2)
fig.update_layout(title_text='Comparison of Pacing Times Between Datasets', showlegend=False)
fig.update_xaxes(title_text='Pacing Time (seconds)', row=1, col=1)
fig.update_yaxes(title_text='Count - Runners', row=1, col=1)
fig.show()
Summary:
Here we analysed that most of the part in above graph shows the same holding points with each other dataset records for year 2013 and 2022; Only difference from 10,000 Pacing Time seconds.
secondary_dataset['netto_first_half'] = secondary_dataset['halbmarathon']
secondary_dataset['netto_second_half'] = secondary_dataset['netto'] - secondary_dataset['halbmarathon']
def split_type(row):
if row['netto_first_half'] < row['netto_second_half']:
return 'Negative Split'
elif row['netto_first_half'] > row['netto_second_half']:
return 'Positive Split'
else:
return 'Even Split'
secondary_dataset['split_type'] = secondary_dataset.apply(split_type, axis=1)
plt.figure(figsize=(12, 6))
elite_threshold = secondary_dataset['netto'].quantile(0.10)
secondary_dataset['is_elite'] = secondary_dataset['netto'] <= elite_threshold
plt.subplot(1, 2, 1)
sns.violinplot(x='split_type', y='netto', data=secondary_dataset, palette='pastel', inner='quartile')
plt.title("Impact of Pacing on Finish Times")
plt.xlabel("Split Type")
plt.ylabel("Netto Finish Time (seconds)")
plt.subplot(1, 2, 2)
sns.violinplot(x='split_type', y='netto', hue='is_elite', data=secondary_dataset, palette='Set1', split=True, inner='quartile')
plt.title("Impact of Pacing on Finish Times by Elite Status")
plt.xlabel("Split Type")
plt.ylabel("Netto Finish Time (seconds)")
plt.legend(title='Elite Status')
plt.tight_layout()
plt.show()
Summary:
This second dataset 2022 hold the approximately same analysis as we have created above for the primary dataset 2013. There is only slight difference in the Even, negative, Positive split.
elite_count = secondary_dataset.groupby(['sex', 'is_elite']).size().reset_index(name='count')
elite_count_pivot = elite_count.pivot(index='sex', columns='is_elite', values='count').fillna(0)
labels = ['Non-Elite Men', 'Elite Men', 'Non-Elite Women', 'Elite Women']
colors = ['#ff9999', '#ffcc99', '#66b3ff', '#99ff99']
explode = (0.1, 0, 0.1, 0)
fig, ax = plt.subplots()
ax.pie(elite_count_pivot.values.flatten(), labels=labels, autopct='%1.1f%%', colors=colors, startangle=90, pctdistance=0.85, explode=explode)
ax.axis('equal')
plt.title("Percentage of Elite and Non-Elite Athletes by Gender")
plt.legend(title='Elite Status', loc="upper right", bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()
Summary:
From Task 2, to gathering records for 2013 primary dataset:
Elite Athletes:
Non-Elite Athletes:
Men are more likely than women to participate in the non-elite category in both datasets. The distribution of participants in the elite category remains consistent, with a small percentage of participants.
secondary_dataset['percentage_slowdown'] = ((secondary_dataset['netto_second_half'] - secondary_dataset['netto_first_half']) / secondary_dataset['netto_first_half']) * 100
bonk_threshold = 30
secondary_dataset['has_bonked'] = secondary_dataset['percentage_slowdown'] >= bonk_threshold
plt.figure(figsize=(8, 6))
ax = sns.countplot(x='sex', hue='has_bonked', data=secondary_dataset, palette='Set2')
for p in ax.patches:
height = p.get_height()
ax.annotate(f'{height}', (p.get_x() + p.get_width() / 2., height),
ha='center', va='baseline', fontsize=12, color='black', xytext=(0, 5),
textcoords='offset points')
plt.title("Bonking Across Gender with Counts")
plt.xlabel("Gender")
plt.ylabel("Count - Runners")
plt.show()
Summary:
From Task 3, to gathering records for 2013 primary dataset:
2013 Dataset:
* Men:
* True (Elite): 980
* False (Non-elite): 26,539
* Women:
* True (Elite): 122
* False (Non-elite): 882
2022 Dataset:
* Men:
* True (Elite): 1,841
* False (Non-elite): 21,389
* Women:
* True (Elite): 353
* False (Non-elite): 11,141
Generally, the dataset for 2022 seems to have a larger number of participants, with an increase in both elite and non-elite categories.
To summarize, the above analysis of the marathon datasets from 2013 and 2022 provides significant understandings into participant characteristics, the impact of pace on netto finish times, and the dynamics of performannce. A few interesting finds are the impact of pacing strategies, the differences in finish times between age groups and genders as represented by box and viiolin plots, and the distribution of elte and non-elite athlete as shown by pie charts. It has been determined that negative splits extend netto finish times. Gender differences in significant second-half pace drops are shown by the bonking analysis. Pattern recognition is helped by the effective visualization of relationships through scatter plots, such as half marathon time versus netto finish time. The 2022 dataset shows a higher number of participants, while participant demographics and performance trends are consistent across datasets. With a accurate grasp of marathon dynamics provided by these analyses, race strategy decisions for both years can be made with knowledge.